---
title:  Failure Detection and Membership Views
---

<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements.  See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

<%=vars.product_name%> uses failure detection to remove unresponsive members from membership views.

## <a id="concept_CFD13177F78C456095622151D6EE10EB__section_1AAE6C92FED249EFBA476D8A480B8E51" class="no-quick-link"></a>Failure Detection

Network partitioning has a failure detection protocol that is not subject to hanging when NICs or machines fail. Failure detection has each member observe messages from the peer to its right within the membership view (see "Membership Views" below for the view layout). A member that suspects the failure of its peer to the right sends a datagram heartbeat request to the suspect member. With no response from the suspect member, the suspicious member broadcasts a `SuspectMembersMessage` datagram message to all other members. The coordinator attempts to connect to the suspect member. If the connection attempt is unsuccessful, the suspect member is removed from the membership view. The suspect member is sent a message to disconnect from the cluster and close the cache. In parallel to the receipt of the `SuspectMembersMessage`, a distributed algorithm promotes the leftmost member within the view to act as the coordinator, if the coordinator is the suspect member.

Failure detection processing is also initiated on a member if the `gemfire.properties` `ack-wait-threshold` elapses before receiving a response to a message, if a TCP/IP connection cannot be made to the member for peer-to-peer (P2P) messaging, and if no other traffic is detected from the member.

**Note:**
The TCP connection ping is not used for connection keep alive purposes; it is only used to detect failed members. See [TCP/IP KeepAlive Configuration](../monitor_tune/socket_tcp_keepalive.html#topic_jvc_pw3_34) for TCP keep alive configuration.

If a new membership view is sent out that includes one or more failed members, the coordinator will log new quorum weight calculations. At any point, if quorum loss is detected due to unresponsive processes, the coordinator will also log a severe level message to identify the failed members:
``` pre
Possible loss of quorum detected due to loss of {0} cache processes: {1}
```

in which {0} is the number of processes that failed and {1} lists the members (cache processes).

## <a id="concept_CFD13177F78C456095622151D6EE10EB__section_1170FBBD6B7A483AB2C2A837F1B8876D" class="no-quick-link"></a>Membership Views

The following is a sample membership view:

``` pre
[info 2012/01/06 11:44:08.164 PST bridgegemfire1 <UDP Incoming Message Handler> tid=0x1f] 
Membership: received new view  [ent(5767)<v0>:8700|16] [ent(5767)<v0>:8700/44876, 
ent(5829)<v1>:48034/55334, ent(5875)<v2>:4738/54595, ent(5822)<v5>:49380/39564, 
ent(8788)<v7>:24136/53525]
```

The components of the membership view are as follows:

-   The first part of the view (`[ent(5767)<v0>:8700|16]` in the example above) corresponds to the view ID. It identifies:
    -   the address and processId of the membership coordinator-- `ent(5767)` in example above.
    -   the view-number (`<vXX>`) of the membership view that the member first appeared in-- `<v0>` in example above.
    -   membership-port of the membership coordinator-- `8700` in the example above.
    -   view-number-- `16` in the example above
-   The second part of the view lists all of the member processes in the current view. `[ent(5767)<v0>:8700/44876,                             ent(5829)<v1>:48034/55334, ent(5875)<v2>:4738/54595,                             ent(5822)<v5>:49380/39564,                             ent(8788)<v7>:24136/53525]` in the example above.
-   The overall format of each listed member is:`Address(processId)<vXX>:membership-port/distribution                             port`. The membership coordinator is almost always the first member in the view and the rest are ordered by age.
-   The membership-port is the JGroups TCP UDP port that it uses to send datagrams. The distribution-port is the TCP/IP port that is used for cache messaging.
-   Each member watches the member to its right for failure detection purposes.

